Skip to content

Wire authserver DCR resolver and add structured logs#5044

Merged
tgrunnagle merged 7 commits intomainfrom
issue_5039_dcr-2c
May 5, 2026
Merged

Wire authserver DCR resolver and add structured logs#5044
tgrunnagle merged 7 commits intomainfrom
issue_5039_dcr-2c

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle commented Apr 23, 2026

Summary

  • Why: resolveDCRCredentials was wired up and unit-tested in the prior sub-issue, but EmbeddedAuthServer never called it, so any upstream configured with DCRConfig still booted without a client. This PR makes the resolver load-bearing and adds the operator-visible structured logs the Phase 2 spec calls for (cache-hit age, stale-registration warning, per-step errors, and a remediation hint when an upstream rejects a cached client as 4xx).
  • What:
    • EmbeddedAuthServer now owns an in-memory DCRCredentialStore and calls resolveDCRCredentials + consumeResolution for every OAuth2 upstream whose config requests DCR. buildPureOAuth2Config deliberately keeps its pure, network-free signature; the DCR-resolved ClientSecret is layered on afterwards via a new applyResolutionToOAuth2Config helper so the two application sites stay side-by-side.
    • Each upstream RunConfig element is shallow-copied and its OAuth2 sub-config is deep-copied before mutation, preserving the caller's RunConfig.Upstreams slice.
    • resolveDCRCredentials returns a new dcrStepError wrapping the originating step (metadata discovery, DCR call, store read, store write) instead of logging-and-returning at each branch. The boundary caller (buildUpstreamConfigs) emits one slog.Error via logDCRStepError, so there is a single error line per failure. A panic inside the singleflight closure is converted to the same dcrStepError carrying the captured debug.Stack(), so the boundary log surfaces it without an in-defer duplicate.
    • Cache-hit path emits slog.Debug with dcr_age_days and additionally slog.Warn with remediation text when the cached registration exceeds the 90-day dcrStaleAgeThreshold.
    • The /oauth/register handler success log emits at Debug with issuer, client_id, software_id, token_endpoint_auth_method, and scopes — silent at INFO+ per the project's logging convention. SoftwareID is threaded through ValidateDCRRequest, which now also caps it to 256 printable-ASCII characters.
    • MonitoredTokenSource accepts upstream and client_id at construction (replacing the earlier SetUpstreamContext setter, which had an unsynchronized writer). On the first transition to Unauthenticated driven by a permanent token-endpoint 4xx, it emits a single slog.Warn with a DCR/CIMD remediation hint (gated by stopOnce so a tight Token() loop cannot spam it, and suppressed when no client_id context is available). The classifier itself (isTransientNetworkError) is side-effect free.
    • NewHandler now fails loudly when AuthorizationServerConfig or its embedded *fosite.Config is nil; issuer() no longer carries a runtime nil-guard.
    • Error strings are run through a sanitizer that strips URL userinfo, query, and fragment components before logging. The regex matches http(s):// case-insensitively per RFC 3986 §3.1, and trailing sentence punctuation around a URL is preserved so prose is not mangled.
    • pkg/oauthproto/dcr.go handleHTTPResponse caps the upstream /register error body at 8 KiB via io.LimitReader so a non-conformant or malicious upstream cannot inflate operator log volume.

Closes #5039

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

New coverage added:

  • pkg/authserver/runner/embeddedauthserver_test.go — full DCR boot against a mock AS (metadata + /register), cache-hit short-circuit (zero additional HTTP requests on the second buildUpstreamConfigs), copy-before-mutate assertion on the caller's RunConfig.Upstreams slice, and TestNewEmbeddedAuthServer_DCRBoot driving the real constructor to assert dcrStore is populated.
  • pkg/authserver/runner/dcr_test.go — cache-hit Debug log with dcr_age_days, stale-age Warn above threshold, one dcrStepError-returning case per failure step, and the panic-recovery test now asserts the wrapped error is a *dcrStepError(dcrStepRegister, …, Stack: <captured>).
  • pkg/authserver/server/handlers/handler_chain_test.goTestHandler_issuer covers both populated and zero-value AccessTokenIssuer. TestNewHandler_ErrorsOnNilConfig (in authorize_test.go) pins the new constructor invariant.
  • pkg/authserver/server/registration/dcr_test.gosoftware_id length cap and non-printable-ASCII rejection.
  • pkg/auth/monitored_token_source_test.go — permanent-4xx branch emits the remediation Warn with workload/upstream/client_id populated via the new constructor parameters.
  • pkg/runner/runner_test.goTestRemoteAuthLogContext covers the full cached CIMD > cached DCR > static precedence.
  • Sanitizer unit tests cover commas, periods, closing parens, mixed runs, a Go http.Client-style quoted URL, embedded user:pass@host userinfo (with and without query), implicit-flow #access_token=… fragments, and uppercase HTTPS:// schemes.

Changes

File Change
pkg/authserver/runner/embeddedauthserver.go Add dcrStore field and wire resolver into buildUpstreamConfigs; introduce applyResolutionToOAuth2Config; document the per-instance store vs. process-wide singleflight asymmetry.
pkg/authserver/runner/dcr.go Introduce dcrStepError / logDCRStepError; emit cache-hit Debug, stale-age Warn, and per-step error context; carry panic stack on the wrapped error so a single boundary record surfaces it; URL-secret-stripping sanitizer (userinfo + query + fragment, case-insensitive scheme match).
pkg/authserver/runner/dcr_store.go Inline doc rationale (no path citations) for resource-leak, copy-before-mutate, and constructor-validation reasoning.
pkg/authserver/server/handlers/dcr.go DCR success log emits at Debug, drop redundant upstream attribute, thread software_id through.
pkg/authserver/server/handlers/handler.go NewHandler validates config and config.Config; remove the runtime nil-guard from issuer().
pkg/authserver/server/registration/dcr.go Cap software_id to 256 printable-ASCII characters; export MaxSoftwareIDLength.
pkg/auth/monitored_token_source.go Accept upstream/client_id as constructor parameters; emit DCR/CIMD remediation Warn from markAsUnauthenticated (once per state transition, gated by stopOnce, suppressed when clientID == ""). The classifier is side-effect free.
pkg/runner/runner.go Populate new MonitoredTokenSource fields from RemoteAuthConfig with cached CIMD > cached DCR > static precedence (mirrors remote.Handler.resolveClientCredentials).
pkg/oauthproto/dcr.go Cap the upstream /register error body at 8 KiB via io.LimitReader before embedding it in the returned error string.
pkg/authserver/runner/embeddedauthserver_test.go DCR boot, cache-hit, and copy-before-mutate integration tests.
pkg/authserver/runner/dcr_test.go Log and per-step error tests, sanitizer userinfo/fragment/uppercase cases, augmented panic-recovery assertions.
pkg/authserver/server/handlers/handler_chain_test.go, authorize_test.go TestHandler_issuer and TestNewHandler_ErrorsOnNilConfig.
pkg/authserver/server/handlers/dcr_test.go, pkg/authserver/server/registration/dcr_test.go Assertions for new log fields and software_id validation.
pkg/auth/monitored_token_source_test.go, pkg/runner/runner_test.go New constructor plumbing, remediation Warn, and the CIMD-wins precedence case (closes review finding F26).

Large PR Justification

The PR exceeds the 400-LOC / 10-file guideline (1342+/104- across 18 files), but the change cannot be split without leaving the codebase in an intermediate non-functional state:

  • ~660 lines are tests (*_test.go) — table-driven coverage for the resolver wiring, the new error taxonomy, the staleness + remediation logs, and the sanitizer.
  • The remaining ~510 production lines across 9 files implement one logical change (wire the resolver into the auth-server boot path) plus four tightly-coupled sub-concerns: resolver wiring, dcrStepError taxonomy + boundary logging, monitored-token-source remediation log, and software_id validation. Splitting the resolver wiring without the matching error taxonomy would land a regression in observability; splitting the error taxonomy without the wiring would orphan the new types.

Does this introduce a user-facing change?

Operators gain two new diagnostic signals on DCR-enabled upstreams:

  1. A Warn when a cached DCR registration is older than 90 days, hinting that the operator may want to rotate.
  2. A Warn on the first transition to Unauthenticated driven by a permanent token-endpoint 4xx, hinting that the workload's cached DCR or CIMD credentials may be stale and that deleting them and restarting will trigger re-registration. The Warn fires at most once per state transition and is suppressed entirely for workloads with no client_id context.

The MonitoredTokenSource constructor signature has changed to require upstream and client_id, but it is an internal API with no public surface, and the prior SetUpstreamContext setter is removed.

Successful DCR registrations log at Debug, not Info, in keeping with the project's silent-success-at-INFO+ convention.

Special notes for reviewers

  • The purity gate on buildPureOAuth2Config is load-bearing: it takes no context.Context and performs no network I/O. DCR resolution is deliberately sequenced around it, not inside it, so the pure config builder stays trivially testable. Please flag any future change that threads ctx into it.
  • dcrStepError moves the "one error line per failure" invariant from the resolver to the boundary caller. If a future caller forgets to invoke logDCRStepError, the error will still propagate normally but the structured log will be missing. Worth keeping in mind if we add more resolver call sites.
  • consumeResolution (formerly applyResolution) is a one-shot state transition, not an idempotent default-fill: a second call after the first is a no-op only because the first cleared DCRConfig. The companion applyResolutionToOAuth2Config writes the inline-only ClientSecret onto the built *upstream.OAuth2Config; any future output path from OAuth2UpstreamRunConfig to upstream.OAuth2Config must call both helpers to get a fully-resolved DCR client.
  • The package-level dcrFlight singleflight is intentionally process-wide while EmbeddedAuthServer.dcrStore is per-instance — see the doc comments on each. The asymmetry is load-bearing for concurrent EmbeddedAuthServer instances in the same process.

Generated with Claude Code

@tgrunnagle tgrunnagle changed the base branch from main to issue_5038_dcr-2b April 23, 2026 20:47
@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 74.11168% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.56%. Comparing base (b532156) to head (0073ad7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/authserver/runner/dcr.go 68.53% 26 Missing and 2 partials ⚠️
pkg/auth/monitored_token_source.go 71.79% 10 Missing and 1 partial ⚠️
pkg/runner/runner.go 0.00% 9 Missing ⚠️
pkg/authserver/runner/embeddedauthserver.go 83.33% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5044      +/-   ##
==========================================
+ Coverage   67.50%   67.56%   +0.05%     
==========================================
  Files         601      601              
  Lines       61055    61262     +207     
==========================================
+ Hits        41218    41391     +173     
- Misses      16721    16745      +24     
- Partials     3116     3126      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 27, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 28, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 29, 2026
tgrunnagle added a commit that referenced this pull request Apr 29, 2026
Closes the 14 review threads from review #4199898531 that are scoped to
this PR; #3 and #5 are intentionally deferred to PR #5044 which is the
PR that consumes the ClientSecret plumbing and RedirectURI propagation.

Behaviour fixes:

- #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now
  preserve the issuer's path component, so multi-tenant upstreams that
  ship DCR without advertising it (or this auth server when it itself
  has a tenant prefix in its issuer) keep their tenant prefix in the
  synthesised URL and the defaulted redirect URI.
- #7 resolveDCRCredentials wraps registration in a package-level
  singleflight.Group keyed by DCRKey so concurrent callers for the same
  upstream coalesce into a single RFC 7591 registration. The leader's
  closure rechecks the cache before any network I/O so followers that
  arrive just after a Put returns observe the fresh entry.
- #9 applyResolution writes ClientID only when the run-config copy
  leaves it empty, treating it symmetrically with the endpoint writes.
  This is defence-in-depth for callers that bypass the resolver's
  validateResolveInputs precondition check.
- #10 the cfg.RegistrationEndpoint short-circuit branch in
  resolveDCREndpoints validates URL well-formedness and HTTPS-or-
  loopback locally, so a non-HTTPS endpoint fails before
  performRegistration constructs a bearer-token transport for it.
- #13 endpointsFromMetadata validates discovered authorization_endpoint
  and token_endpoint before populating dcrEndpoints, so a self-
  consistent metadata document with http:// URLs cannot flow through
  to the auth-code or token-exchange path.

Doc / scope:

- #6 trust-assumption note added on DCRUpstreamConfig with a link to
  the follow-up SSRF-hardening issue (#5135) per the
  "Document Architectural Constraints" rule.
- #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what
  the in-memory store does and does NOT solve (cross-replica sharing,
  durability, cross-process write coordination).

Test hygiene:

- #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that
  added no meaningful coverage.
- #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in
  TestFetchAuthorizationServerMetadataFromURL, matching the sibling
  pattern in dcr_test.go.
- #12 new failingDCRStore test double + two cases asserting the
  cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache
  put") that operators see when a future Redis backend errors.
- #14 renamed the TestNeedsDCR case from
  "client_id with dcr (invalid but returns false)" to
  "client_id wins over dcr_config (defensive AND semantic)" so the
  case label describes the intentional defensive behaviour rather
  than reading like a bug report.
- #15 removed a //nolint:lll directive that was suppressing a non-
  existent violation (the line is 108 chars; the limit is 130).
- #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out
  16 goroutines doing alternating Put/Get against overlapping and
  disjoint keys, with a fail-fast deadline, so go test -race actually
  exercises the sync.RWMutex guard.

New tests cover each behaviour change:

- TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers
  pins exactly-one-registration under N concurrent callers.
- TestSynthesiseRegistrationEndpoint_PreservesIssuerPath /
  TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path-
  preservation fix.
- TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers
  the symmetric ClientID write.
- TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers
  the local registration-endpoint guard.
- TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers
  the metadata authz/token endpoint guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tgrunnagle added a commit that referenced this pull request Apr 29, 2026
Closes the 14 review threads from review #4199898531 that are scoped to
this PR; #3 and #5 are intentionally deferred to PR #5044 which is the
PR that consumes the ClientSecret plumbing and RedirectURI propagation.

Behaviour fixes:

- #2 synthesiseRegistrationEndpoint and resolveUpstreamRedirectURI now
  preserve the issuer's path component, so multi-tenant upstreams that
  ship DCR without advertising it (or this auth server when it itself
  has a tenant prefix in its issuer) keep their tenant prefix in the
  synthesised URL and the defaulted redirect URI.
- #7 resolveDCRCredentials wraps registration in a package-level
  singleflight.Group keyed by DCRKey so concurrent callers for the same
  upstream coalesce into a single RFC 7591 registration. The leader's
  closure rechecks the cache before any network I/O so followers that
  arrive just after a Put returns observe the fresh entry.
- #9 applyResolution writes ClientID only when the run-config copy
  leaves it empty, treating it symmetrically with the endpoint writes.
  This is defence-in-depth for callers that bypass the resolver's
  validateResolveInputs precondition check.
- #10 the cfg.RegistrationEndpoint short-circuit branch in
  resolveDCREndpoints validates URL well-formedness and HTTPS-or-
  loopback locally, so a non-HTTPS endpoint fails before
  performRegistration constructs a bearer-token transport for it.
- #13 endpointsFromMetadata validates discovered authorization_endpoint
  and token_endpoint before populating dcrEndpoints, so a self-
  consistent metadata document with http:// URLs cannot flow through
  to the auth-code or token-exchange path.

Doc / scope:

- #6 trust-assumption note added on DCRUpstreamConfig with a link to
  the follow-up SSRF-hardening issue (#5135) per the
  "Document Architectural Constraints" rule.
- #8 NewInMemoryDCRCredentialStore doc now explicitly calls out what
  the in-memory store does and does NOT solve (cross-replica sharing,
  durability, cross-process write coordination).

Test hygiene:

- #4 deleted the literal-equals-literal dcrStaleAgeThreshold test that
  added no meaningful coverage.
- #11 atomic.AddInt32 / LoadInt32 for wellKnownHits / customHits in
  TestFetchAuthorizationServerMetadataFromURL, matching the sibling
  pattern in dcr_test.go.
- #12 new failingDCRStore test double + two cases asserting the
  cache.Get / cache.Put wrap messages ("dcr: cache lookup", "dcr: cache
  put") that operators see when a future Redis backend errors.
- #14 renamed the TestNeedsDCR case from
  "client_id with dcr (invalid but returns false)" to
  "client_id wins over dcr_config (defensive AND semantic)" so the
  case label describes the intentional defensive behaviour rather
  than reading like a bug report.
- #15 removed a //nolint:lll directive that was suppressing a non-
  existent violation (the line is 108 chars; the limit is 130).
- #16 new TestInMemoryDCRCredentialStore_ConcurrentAccess fans out
  16 goroutines doing alternating Put/Get against overlapping and
  disjoint keys, with a fail-fast deadline, so go test -race actually
  exercises the sync.RWMutex guard.

New tests cover each behaviour change:

- TestResolveDCRCredentials_SingleflightCoalescesConcurrentCallers
  pins exactly-one-registration under N concurrent callers.
- TestSynthesiseRegistrationEndpoint_PreservesIssuerPath /
  TestResolveUpstreamRedirectURI_PreservesIssuerPath cover the path-
  preservation fix.
- TestApplyResolution_DoesNotOverwritePreProvisionedClientID covers
  the symmetric ClientID write.
- TestResolveDCREndpoints_DirectRegistrationEndpointValidated covers
  the local registration-endpoint guard.
- TestEndpointsFromMetadata_RejectsInsecureDiscoveredEndpoints covers
  the metadata authz/token endpoint guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tgrunnagle and others added 3 commits May 1, 2026 09:42
isTransientNetworkError previously emitted the cached-DCR-client
remediation Warn from inside its classifier on every permanent 4xx.
A tight Token() loop hitting the same condition would spam the same
record on every call before the workload's Unauthenticated status
propagated. The same branch also fired for non-DCR workloads, which
saw a remediation telling them to "delete cached credentials" they
never had.

Strip the side effect from the classifier and emit the Warn from
markAsUnauthenticated, which already gates the close-monitoring
transition through stopOnce. The Warn now fires at most once per
state transition, is suppressed when no client_id context is
available, and reads honestly about the variability ("if this
workload uses cached DCR or CIMD credentials they may be stale").

Addresses #5044 review finding F5 (MEDIUM).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two related polish items in the authserver handlers package:

NewHandler did not validate that AuthorizationServerConfig (or its
embedded *fosite.Config) was non-nil. issuer() reached into the
config at request time, so a misconfigured caller would panic deep
inside an HTTP handler instead of failing at startup. Add the nil
check to the constructor and simplify issuer() to rely on the
constructor invariant. Pin the new invariant with
TestNewHandler_ErrorsOnNilConfig.

The /oauth/register success log was promoted to Info even though the
operation is neither long-running nor exceptional. Demote back to
Debug; an audit-log path is the right home for the audit signal if
that becomes a requirement.

Addresses #5044 review findings F6 and F7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundles ten review-feedback items in pkg/authserver/runner; each
addresses internal-API hygiene or doc-comment drift in the DCR
resolver, no behaviour change for callers.

Sanitizer hardening (F1, F19, F18, F23): sanitizeErrorForLog now
strips userinfo and fragment in addition to the query, since either
can carry credentials or tokens (https://user:pass@host,
implicit-flow #access_token=...). queryStrippingPattern matches
http/https case-insensitively per RFC 3986 §3.1 so HTTPS://...
cannot escape sanitisation. trimURLTrailingPunctuation switches to
strings.IndexByte to match the ASCII-only terminator set without
the rune-decoding overhead. Test cases added for each.

Resolver error API (F4, F13, F20): the dcrStepRegister panic-recovery
branch no longer emits a duplicate slog.Error; the captured stack
travels with the wrapped *dcrStepError to the boundary log so a
single panic produces a single record. DCRStepError /
LogDCRStepError lowercased to dcrStepError / logDCRStepError since
no caller lives outside the package. logDCRStepError now no-ops on
nil err so the unknown-step branch cannot fire on a missing failure.

Resolution helpers (F11, F12, F25): applyResolution renamed to
consumeResolution to communicate that it is a one-shot state
transition (clearing DCRConfig is unconditional), not an idempotent
defaulting step. The applyResolutionToOAuth2Config doc now states
the paired-call invariant explicitly without referencing a specific
test.

Lifecycle docs (F21, F22): the per-instance dcrStore vs.
process-wide dcrFlight asymmetry is now stated on both sides, and
EmbeddedAuthServer.Close documents the future-Close hook for a
backend with handles.

Inline rules-file rationale (F24): production comments no longer
cite .claude/rules/... by path; the principle is inlined.

Addresses #5044 review findings F1, F4, F11, F12,
F13, F18, F19, F20, F21, F22, F23, F24, F25.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tgrunnagle
Copy link
Copy Markdown
Contributor Author

Body-only review items follow-up:

F2 (HIGH) — Addressed in 4d504fe. pkg/oauthproto/dcr.go was outside this PR's diff; the unbounded io.ReadAll on the upstream's /register error body is now wrapped in io.LimitReader(resp.Body, 8*1024), so a non-conformant upstream cannot inflate operator log volume. I kept the existing %s-into-fmt.Errorf shape rather than parsing as RFC 7591 JSON; the cap is the load-bearing fix here.

F8 / F9 / F10 / F26-F30 (test-gap cluster) — Deferred. F26 was closed mechanically by the new CIMD case in TestRemoteAuthLogContext that landed with the F3 fix. The remaining items (slog-handler tee, failure-path subtest, panic-recovery errors.As(*dcrStepError) augmentation, etc.) I'd like to land as a follow-up so this PR stays focused on the correctness/security fixes. Happy to open a tracking issue if useful.

Generated with Claude Code

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 1, 2026
@github-actions github-actions Bot dismissed their stale review May 1, 2026 16:54

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 1, 2026
@tgrunnagle tgrunnagle marked this pull request as ready for review May 1, 2026 16:54
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes — none blocking.

Comment thread pkg/auth/monitored_token_source.go Outdated
Comment thread pkg/oauthproto/dcr.go
Comment thread pkg/runner/runner.go Outdated
Three follow-up items from review:

isPermanentTokenEndpointError used to treat every non-5xx
RetrieveError as permanent, which meant the DCR remediation Warn
fired on transient back-pressure (408 Request Timeout, 429 Too Many
Requests). Narrow the classifier to an explicit 400 / 401 / 403
allowlist so the "delete the cached credentials and restart" hint
fires only on responses where stale cached client credentials are a
plausible cause. Retry behaviour is unaffected — the retry decision
is gated entirely by isTransientNetworkError, which already returns
false for the whole RetrieveError branch.

handleHTTPResponse in pkg/oauthproto/dcr.go now drains the response
body after the LimitReader-bounded ReadAll. Without this, an upstream
returning more than 8 KiB on /register error left the connection mid-
stream when the deferred Close fired, preventing TCP connection
reuse. Mirrors what the non-JSON content-type branch a few lines
down already does.

remoteAuthLogContext moved out of pkg/runner into pkg/auth/remote as
*Config.LogContext(), collocating it with resolveClientCredentials
so the cached-CIMD > cached-DCR > static precedence has one home.
Adding a fourth cached field in future updates both call sites in
one place. Tests moved to pkg/auth/remote/config_test.go.

Addresses #5044 review #4212656411 comments
3174540081, 3174540082, and 3174540086.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 1, 2026
@tgrunnagle tgrunnagle requested a review from jhrozek May 1, 2026 20:21
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified all three follow-ups landed in 0073ad7 and look right — the classifier allowlist, the body drain, and the LogContext relocation. Thanks for the quick turnaround.

@tgrunnagle tgrunnagle merged commit 9572f6a into main May 5, 2026
44 checks passed
@tgrunnagle tgrunnagle deleted the issue_5039_dcr-2c branch May 5, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Authserver DCR: wire resolver into authserver and add structured logs (Phase 2, Steps 2d/2g)

2 participants